font-family: ‘Helvetica’ width: 1024 height: 768
A two day workshop in 3 hours
Mine Çetinkaya-Rundel, Duke University
Colin Rundel, Duke University
François Michonneau, University of Florida
Tracy Teal, Data Carpentry
incremental: true
Lack of reproducibility in science causes significant issues
Science retracted (without lead author’s consent) a study of how canvassers can sway people’s opinions about gay marriage
Original survey data was not made available for independent reproduction of results (and survey incentives were misrepresented, and sponsorship statements were false)
Two Berkeley grad students attempted to replicate the study and discovered serious issues with the data (likely fabricated, and how they were fabricated).
Lack of reproducibility in science causes significant issues
Lack of reproducibility in science causes significant issues
Reproducible science accelerates scientific progress.
incremental: true
Methods are codified by definition, yet still challenging to reproduce
See an experiment on reproducing reproducible computational research
Day 1 * Motivation of and introduction to Reproducible Research * Best practices for file naming and file organization * Best practices for tabular data * Literate programming and executable documentation of data modification
Day 2 * Version control and Git * Why automate? * Transforming repetitive R script code into R functions * Automated testing and integration testing * Sharing, publishing, and archiving for data and code
type: titleonly
type: prompt
This is a two-part exercise:
Part 1: Analyze + document
Part 2: Swap + discuss
type: prompt
Complete the following task and write instructions / documentation for your collaborator to reproduce your work starting with the original dataset (data/gapminder-5060.csv).
Download material: http://bit.ly/2sEPe4z -> Full Link
Visualize life expectancy over time for Canadians in the 1950s and 1960s using a line plot.
Something should be clearly wrong with your plot, figure out (and document) what this is and come up with a fix that resolves this issue.
Visualize life expectancy over time for Canadians again, with the corrected data.
Stretch goal: Add additional lines for the life expectancy of Mexician and Americans as well.
type: prompt
Introduce yourself to your collaborator (neighbor).
If your collaborator/neighbor does not have or is unfamiliar with the software you used we encourage you to given them a brief explination of what it is and why you chose it. (Remember, this could be part of the irreproducibility problem!)
type: prompt
This exercise: - What tools did you use (Excel, R / Python, Word / plain text etc.)? - What made it easy / hard for reproducing your partners’ work?
In a “real life” setting: - What would happen if your colleague/collaborator is no longer available to walk you through their analysis? - What would have to happen if you - had to swap out the dataset or extend the analysis? - caught further errors and had to re-create the analysis? - you had to revert back to the original dataset?
type: titleonly
Organization: tools to organize your projects so that you don’t have a single folder with hundreds of files
Automation: the power of scripting to create automated data analyses
Dissemination: publishing is not the end of your analysis, rather it is a way station towards your future research and the future research of others
type: titleonly
Provenance with results pasted into manuscript:
type: titleonly
Life expectancy shouldn’t exceed even the most extreme age observed for humans.
if (any(gap_5060$lifeExp > 150)) {
stop("improbably high life expectancies")
}
Error in eval(expr, envir, enclos): improbably high life expectancies
The library testthat allows us to make this a little more readable:
library(testthat)
expect_false(any(gap_5060$lifeExp > 150),
"improbably high life expectancies")
type: titleonly
- There are going to be files. Lots of files. - They will change over time. - They will have differing relationships to each other.
File organization and naming are effective weapons against chaos.
type: prompt
Your data files contain readings from a well plate, one file per well, using a specific assay run on a certain date, after a certain treatment.
$ ls *Plsmd*
2013-06-26_BRAFASSAY_Plsmd-CL56-1MutFrac_A01.csv
2013-06-26_BRAFASSAY_Plsmd-CL56-1MutFrac_A02.csv
2013-06-26_BRAFASSAY_Plsmd-CL56-1MutFrac_A03.csv
2013-06-26_BRAFASSAY_Plsmd-CL56-1MutFrac_B01.csv
2013-06-26_BRAFASSAY_Plsmd-CL56-1MutFrac_B02.csv
...
2013-06-26_BRAFASSAY_Plsmd-CL56-1MutFrac_H03.csv
> list.files(pattern = "Plsmd") %>% head
[1] 2013-06-26_BRAFASSAY_Plsmd-CL56-1MutFrac_A01.csv
[2] 2013-06-26_BRAFASSAY_Plsmd-CL56-1MutFrac_A02.csv
[3] 2013-06-26_BRAFASSAY_Plsmd-CL56-1MutFrac_A03.csv
[4] 2013-06-26_BRAFASSAY_Plsmd-CL56-1MutFrac_B01.csv
[5] 2013-06-26_BRAFASSAY_Plsmd-CL56-1MutFrac_B02.csv
[6] 2013-06-26_BRAFASSAY_Plsmd-CL56-1MutFrac_B03.csv
meta <- stringr::str_split_fixed(flist, "[_\\.]", 5)
colnames(meta) <- c("date", "assay", "experiment",
"well", "ext")
meta[,1:4]
date assay experiment well
[1,] "2013-06-26" "BRAFASSAY" "Plsmd-CL56-1MutFrac" "A01"
[2,] "2013-06-26" "BRAFASSAY" "Plsmd-CL56-1MutFrac" "A02"
[3,] "2013-06-26" "BRAFASSAY" "Plsmd-CL56-1MutFrac" "A03"
[4,] "2013-06-26" "BRAFASSAY" "Plsmd-CL56-1MutFrac" "B01"
[5,] "2013-06-26" "BRAFASSAY" "Plsmd-CL56-1MutFrac" "B02"
[6,] "2013-06-26" "BRAFASSAY" "Plsmd-CL56-1MutFrac" "B03"
type: titleonly
class: centered-image
Noble, William Stafford. 2009. “A Quick Guide to Organizing Computational Biology Projects.” PLoS Computational Biology 5 (7): e1000424.
title: false
|
+-- data-raw/
| |
| +-- gapminder-5060.csv
| +-- gapminder-7080.csv.csv
| +-- ....
|
+-- data-output/
|
+-- fig/
|
+-- R/
| |
| +-- figures.R
| +-- data.R
| +-- utils.R
| +-- dependencies.R
|
+-- tests/
|
+-- manuscript.Rmd
+-- make.R
data-raw: the original data, you shouldn’t edit or otherwise alter any of the files in this folder.data-output: intermediate datasets that will be generated by the analysis.
fig: the folder where we can store the figures used in the manuscript.R: our R code (the functions)
tests: the code to test that our functions are behaving properly and that all our data is included in the analysis.type: titleonly
make_ms <- function() {
rmarkdown::render("manuscript.Rmd",
"html_document")
invisible(file.exists("manuscript.html"))
}
clean_ms <- function() {
res <- file.remove("manuscript.html")
invisible(res)
}
make_all <- function() {
make_data()
make_figures()
make_tests()
make_ms()
}
clean_all <- function() {
clean_data()
clean_figures()
clean_ms()
}
testthat includes a function called test_dir that will run tests included in files in a given directory. We can use it to run all the tests in our tests/ folder.
test_dir("tests/")
Let’s turn it into a function, so we’ll be able to add some additional functionalities to it a little later. We are also going to save it at the root of our working directory in the file called make.R:
## add this to make.R
make_tests <- function() {
test_dir("tests/")
}
type: titleonly
Run on file names
Use informatively named files
2013-10-14_manuscriptFish.doc
2013-10-30_manuscriptFish.doc
2013-11-05_manusctiptFish_intitialRyanEdits.doc
2013-11-10_manuscriptFish.doc
2013-11-11_manuscriptFish.doc
2013-11-15_manuscriptFish.doc
2013-11-30_manuscriptFish.doc
2013-12-01_manuscriptFish.doc
2013-12-02_manuscriptFish_PNASsubmitted.doc
2014-01-03_manuscriptFish_PLOSsubmitted.doc
2014-02-15_manuscriptFish_PLOSrevision.doc
2014-03-14_manuscriptFish_PLOSpublished.doc
Or zip the entire directory of your project files everytime you make a change, and save with date
Use a version control system (e.g. git)
Why use Git?
Features of a hosting service like GitHub
type: titleonly
type: titleonly
left: 70%
Piwowar & Vision (2013) “Data reuse and the open data citation advantage.” PeerJ, e175
Figure 1: Citation density for papers with and without publicly available microarray data, by year of study publication.
left: 70%
Wicherts et al (2011) “Willingness to Share Research Data Is Related to the Strength of the Evidence and the Quality of Reporting of Statistical Results.” PLoS ONE 6(11): e26828
Figure 1. Distribution of reporting errors per paper for papers from which data were shared and from which no data were shared.
title: false
type: titleonly
left: 70%
Morin, Andrew, Jennifer Urban, and Piotr Sliz. 2012. “A Quick Guide to Software Licensing for the Scientist-Programmer.” PLoS Computational Biology 8 (7): e1002598.
type: titleonly
From the Panton Principles: > [In] the scholarly research community the act of citation is a commonly held community norm when reusing another community member’s work. […] A well functioning community supports its members in their application of norms, whereas licences can only be enforced through court action and thus invite people to ignore them when they are confident that this is unlikely.
Peng, R. D. “Reproducible Research in Computational Science” Science 334, no. 6060 (2011): 1226–1227
incremental: true
This slideshow was generated as HTML from Markdown using RStudio.
The Markdown sources, and the HTML, are hosted on Github: https://github.com/fmichonneau/2017-useR-reproducibility